pip install geopandas
!pip install plotly
pip install rtree
import pandas as pd
import os
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
import geopandas as gpd
import plotly.graph_objects as go
df = pd.read_csv('/Users/zhangjing/Desktop/cis545/final/US_Accidents_Dec19.csv')
pd.set_option('display.max_columns', None)
# clean time related variables
import datetime
df['year'] = pd.DatetimeIndex(df['Start_Time']).year
#df = df[df['year'].isin(['2016','2017', '2018', '2019'])] # drop other years
df['month'] = pd.DatetimeIndex(df['Start_Time']).month
df['hour'] = pd.DatetimeIndex(df['Start_Time']).hour
df['weekday'] = pd.DatetimeIndex(df['Start_Time']).dayofweek
df['week'] = pd.DatetimeIndex(df['Start_Time']).week
condition = [
df['weekday']==0, df['weekday']==1, df['weekday']==2,
df['weekday']==3, df['weekday']==4, df['weekday']==5,
df['weekday']==6,
]
choices = ['Mon','Tue','Wed',"Thur","Fri","Sat","Sun"]
df['weekday2'] = np.select(condition, choices, default='black')
df['ref'] = 1
The plot of accidents from 2016 to 2019 shows that most of the accidents happened on the highway and the city's main road. California has the highest number of accidents in the last three years.
plt.figure(figsize=(50,30))
plt.title('Plot of Traffic Accident from 2016 to 2019', fontsize = 50)
plt.xlabel('Longitude',fontsize=40)
plt.ylabel('Latitude',fontsize=40)
plt.plot(df.Start_Lng, df.Start_Lat, ".", alpha=0.5)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.show()
count_by_state = pd.value_counts(df['State']).reset_index()
count_by_state.columns = ['State', 'Value']
state_high = pd.value_counts(df['State']).reset_index()
fig = go.Figure(data=go.Choropleth(
locations = count_by_state.State,
z = count_by_state.Value,
locationmode = 'USA-states',
colorscale = 'tempo',
colorbar_title = "Count of Accidents",
))
fig.update_layout(
title_text = 'US Traffic Accidents by State 2016-2019',
geo_scope='usa', # limite map scope to USA
)
fig.show()
This is the plot of the number of accidents in each week from 2016 to 2019. Its fluctuation indicates that the number of accidents changes periodically, e.g. there is a stark drop in the last week of each year. As for the change of severity, severity 2 type of accidents fluctuate significantly by weeks and it experienced an obvious increase from 2019 July (31st week). However, other types of accidents, especially severe accidents (severity 3 & 4) don't have dramatic quantity changes from 2016 to 2019.
Now, let's zoom into the data by month and week to find if there are potential regular patterns.
fig, ax = plt.subplots(figsize=(13,6))
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
week_year = df.groupby(['year','week','Severity']).count()['ref'].unstack()
week_year.plot(ax=ax)
ax.set_xlabel('Week', fontsize=14)
ax.set_ylabel('Count of Accidents',fontsize=14)
The left diagram shows that the time between August to December is the peak crash period over the span of a year. It seems that Severity 2 changes a lot by time especially after July, while other types of accidents are relatively flat compared with the change of Severity 2
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(13,4))
# month & total count
ax1.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
time_day = pd.value_counts(df['month']).reset_index()
time_day.sort_values('index', inplace=True)
ax1.plot('index', 'month', data = time_day)
ax1.set_title('Month & Total Accident count (2016-2019)', fontsize=14)
ax1.set_xticks(list(time_day.index))
ax1.set_xlabel('Month', fontsize=14)
ax1.set_ylabel('Count of Traffic Accident', fontsize=14)
# month & severity
time_day = df.groupby(['month','Severity']).count()['ref'].unstack()
time_day.plot(ax=ax2)
ax2.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax2.set_xticks(list(time_day.index))
ax2.set_xlabel('Month', fontsize=14)
#ax2.set_ylabel('', fontsize=14)
ax2.set_title('Month & Accident Severity', fontsize=14)
plt.show()
The left plot shows that most of the accidents happened at weekday rush hour (e.g. 7:00-8:00 and 17:00-18:00) which is easy to understand as the traffic volume during that time is also huge. Obviously, there are fewer traffic accidents at the weekend, and most of them concentrated around 11:00 to 15:00. However, accident severity doesn't show a strong relationship with time of day (as the right diagram shows) because the whole tendency is similar to the total count.
import matplotlib
fig, (ax,ax1) = plt.subplots(1,2,figsize=(13,5))
# time & total
df['ref'] = 1
time_day = df.groupby(['hour','weekday2']).count()['ref'].unstack()
time_day.plot(ax=ax)
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_xticks(list(time_day.index))
ax.set_xlabel('Time of Day', fontsize=14)
ax.set_ylabel('Number of Traffic Accidents', fontsize=14)
ax.set_title('Count of Traffic Accident by Time', fontsize=14)
# time & severity
time_day = df.groupby(['hour','Severity']).count()['ref'].unstack()
time_day.plot(ax=ax1)
ax1.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax1.set_xticks(list(time_day.index))
ax1.set_xlabel('Time of Day', fontsize=14)
ax1.set_title('Time & Accident Severity', fontsize=14)
Before we dive into the exploration of weather-related features, it's important to check the number of NA of each feature. This diagram shows that some weather-related variables are kind of useless, e.g. precipitation, wind chill, because their NA percentage is high to 30% or even 60%. Thus, replace the NA by average or 0 is not suitable due to a large number of missing. Therefore, those variables will be dropped.
NA_count = df.isna().sum().reset_index()
NA_count.columns = ['index', 'Value']
NA_count['count'] = (NA_count['Value']*100)/len(df)
NA_count = NA_count[NA_count['count']>0].sort_values('count')
fig, ax = plt.subplots(figsize=(10,6))
NA_count.plot.barh('index','count', width=0.8,color='c',align='center',linewidth=1.2,ax=ax)
ax.set_xlabel('NA percentage', fontsize=14)
ax.set_ylabel('Name of Feature',fontsize=14)
ax.set_title('Percentage of NA in each feature',fontsize=16)
fig, ax = plt.subplots(figsize=(13,6))
weather_condition = df['Weather_Condition'].value_counts().sort_values(ascending=False).head(30)
weather_condition.plot.bar(width=0.8,color='c',align='center',linewidth=1.2)
ax.get_yaxis().set_major_formatter(
matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_xlabel('Weather condition', fontsize=14)
ax.set_ylabel('Count of Traffic Accident',fontsize=14)
ax.set_title('Traffic Accident & Weather Condition',fontsize=16)
Let's start with the analysis of weather condition description. This bar chart calculates the number of traffic accidents in different weather conditions. It seems that most of the accidents occur in Clear,Mostly Cloudy,Overcastdays, while it doesn't mean that those weather conditions can cause accidents. A more accurate method is focusing on the percentage, aka, the percentage of severity under each weather condition. In that case, we can figure out which types of weather will result in which types of accidents.
Weather_Cond = df.groupby(['Weather_Condition','Severity']).count()['ref'].reset_index()
#['Weather_Condition']
severity_1_by_weather = Weather_Cond[Weather_Cond['Severity']==1]['ref']
severity_2_by_weather = Weather_Cond[Weather_Cond['Severity']==2]['ref']
severity_3_by_weather = Weather_Cond[Weather_Cond['Severity']==3]['ref']
severity_4_by_weather = Weather_Cond[Weather_Cond['Severity']==4]['ref']
weather_count = df.groupby(['Weather_Condition','Severity']).count()['ref'].reset_index()
weather_df2 = weather_count[weather_count['ref']>20]
weather_count2 = weather_df2.groupby(['Weather_Condition']).sum()[['ref']]
weather_df3 = pd.merge(weather_df2, weather_count2, how="left", on ='Weather_Condition')
weather_df3['percent']= (weather_df3['ref_x']*100)/weather_df3['ref_y']
weather_df3
weather_Severity1 = weather_df3[weather_df3['Severity']==1]
weather_Severity2 = weather_df3[weather_df3['Severity']==2].sort_values('percent',ascending=False)
weather_Severity3 = weather_df3[weather_df3['Severity']==3]
weather_Severity4 = weather_df3[weather_df3['Severity']==4]
plt.figure(figsize=(25, 10))
plt.bar(weather_Severity2.Weather_Condition, weather_Severity2.percent, label='Severity 2')
plt.bar(weather_Severity3.Weather_Condition, weather_Severity3.percent, label='Severity 3')
plt.bar(weather_Severity4.Weather_Condition, weather_Severity4.percent, label='Severity 4')
plt.bar(weather_Severity1.Weather_Condition, weather_Severity1.percent, label='Severity 1')
plt.xticks(rotation=90,fontsize=15)
plt.yticks(fontsize=14)
plt.xlabel("Weather Condition (count>20)", fontsize=18)
plt.ylabel("Severity Percentage", fontsize=18)
plt.title("Severity & Weather Condition", fontsize=25)
plt.legend(fontsize =20)
As mentioned before, this stacked bar chart indicates the percentage of accident severity by weather conditions (only select the weather condition when more than 20 cases). For the whole dataset, 67% of the accident is Severity 2, so it's reasonable that for most of the weather conditions, Severity 2 is the majority as blue bar shows. While, notably, the share of severe accidents (Severity 3(orange bar) & Severity 4(green bar)) are significantly high during extreme weather, for example, Blowing Snow, Light Ice Pellets, Light Freezing Rain, Heavy Snow, T-Storm/Windy. As extreme weather is rare, the accidents that occurred during these weathers are also limited compared with the accidents occurred during normal weather. While once there is an accident occurred in extreme weather, it's highly likely to be a severe case.
#df['precipitation'] = df['Precipitation(in)'].fillna(0, inplace=False)
df = df.dropna(subset=['City','Temperature(F)','Visibility(mi)','Pressure(in)','Humidity(%)',
'Weather_Condition'])
df['tem_qcut']=pd.qcut(df['Temperature(F)'], 15)
df['humidity_qcut']=pd.qcut(df['Humidity(%)'], 15)
df2 = df.dropna(subset=['Wind_Speed(mph)'])
df2['wind_speed_qcut']= pd.qcut(df['Wind_Speed(mph)'], 15,duplicates='drop')
df['visi_qcut']=pd.cut(df['Visibility(mi)'], 15,duplicates='drop')
tem_qcut = df.groupby(['tem_qcut']).mean()['Severity'].reset_index()
humidity_qcut = df.groupby(['humidity_qcut']).mean()['Severity'].reset_index()
visi_qcut = df.groupby(['visi_qcut']).mean()['Severity'].reset_index()
wind_speed_qcut = df2.groupby(['wind_speed_qcut']).mean()['Severity'].reset_index()
fig, (ax,ax1,ax2,ax3) = plt.subplots(4,figsize=(10,14))
fig.subplots_adjust(hspace=0.3)
tem_qcut.plot('tem_qcut','Severity',ax=ax)
#ax.set_title("Mean Severity & Temperature(F)", fontsize=14)
ax.set_xlabel("Temperature catogory (quantile)(F)", fontsize=14)
ax.set_ylabel("Mean Severity", fontsize=14)
ax.set_xticklabels(tem_qcut['tem_qcut'])
humidity_qcut.plot('humidity_qcut','Severity',ax=ax1)
#ax1.set_title("Mean Severity & Humidity(%)", fontsize=14)
ax1.set_xlabel("Humidity catogory(%) (quantile)", fontsize=14)
ax1.set_ylabel("Mean Severity", fontsize=14)
ax1.set_xticklabels(humidity_qcut['humidity_qcut'])
wind_speed_qcut.plot('wind_speed_qcut','Severity',ax=ax2)
ax2.set_xlabel("Wind Speed catogory (mph)(quantile)", fontsize=14)
ax2.set_ylabel("Mean Severity", fontsize=14)
ax2.set_xticklabels(wind_speed_qcut['wind_speed_qcut'])
visi_qcut.plot('visi_qcut','Severity',ax=ax3)
ax3.set_xlabel("Visibility catogory(mi)", fontsize=14)
ax3.set_ylabel("Mean Severity", fontsize=14)
ax3.set_xticklabels(visi_qcut['visi_qcut'],rotation=90)
plt.show()
After dropping NA for some useful weather-related features, I explored the relationship between accident severity and Temperature / Humidity / Wind Speed as this series of diagrams plotted. I firstly cut the variables (Temperature / Humidity / Wind Speed) by 10-15 quantile and then group them and calculate the mean severity.
It shows that, as Temperature increases, the severity will decrease. As for Humidity, it can be viewed as a substitution of precipitation (since there are too many NA in precipitation variable), and it shows that higher Humidity associated with severer accidents. The plot didn't tell the strong or obvious relationship between wind speed and severity, while that makes sense to some extent as the wind speed doesn't affect the vehicle's capability or driver's sight. With regard to the Visibility, the relationship is also vague.
import shapely.speedups
shapely.speedups.enable()
import geopandas as gpd
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.Start_Lng, df.Start_Lat))
#grid
path ='/Users/zhangjing/Desktop/cis545/final/grid545_2.geojson'
grid = gpd.read_file(path)
# usa
path ='/Users/zhangjing/Desktop/cis545/final/545usa2.geojson'
usa2 = gpd.read_file(path)
path ='/Users/zhangjing/Desktop/cis545/final/545philly.geojson'
philly = gpd.read_file(path)
gdf.crs = {'init' :'epsg:4326'}
grid.crs = {'init' :'epsg:4326'}
philly.crs = {'init' :'epsg:4326'}
grid1 = grid
grid = grid.reset_index()
grid.rename(columns = {'index':'Grid_ID'}, inplace = True)
#grid
As mentioned before, this project cut Amerian into many 10*10km grids, and then count the number of accidents at a certain time within each grid. This first plot is the grid of the USA and all accidents. The second plot is the zoom-in version of intersected grids (only keep the grids that had accidents).
fig, ax = plt.subplots(figsize=(50,30))
#df.plot(ax =ax, markersize=3, color="blue")
plt.plot(df.Start_Lng, df.Start_Lat, ".", alpha=0.5)
grid.plot(ax =ax, facecolor="none", edgecolor='black',lw=0.1)
plt.title('All 10*10km grid & all accidents', fontsize = 50)
select_grid = all_model['Grid_ID'].drop_duplicates(keep='first').reset_index()
select_grid['ref1'] = 1
select_grid["Grid_ID"] = select_grid["Grid_ID"].astype(int)
grid_select_sf = pd.merge(grid,select_grid, how='right', on=['Grid_ID'])
grid_select_sf = grid_select_sf[grid_select_sf.geometry.type == 'Polygon']
grid_select_sf
fig, ax = plt.subplots(figsize=(30,20))
#join_new.plot(ax =ax, facecolor="none", edgecolor='black',lw=0.01)
philly.plot(ax=ax,facecolor="none",edgecolor='black',lw=5)
grid_select_sf.plot(ax =ax, facecolor="none", edgecolor='black',lw=0.3)
plt.plot(df.Start_Lng, df.Start_Lat, ".",ms=0.5)
ax.set_xlim(-78, -72)
ax.set_ylim(39, 43)
plt.title('Selected 10*10km grid & all accidents & Philly boundary', fontsize = 30)
For the third plot, we zoomed into Philly to get more sense of grid and accidents. It seems that the number of accidents within Philadelphia are much less than the accidents outside Philadelphia. Also Ablington & Plymouth Meeting (northwest of Philadelphia) suffer severe traffic safety issues in the past three years. It might be attributed to the traffic regulation of the county or the issue with the data. I prefer to believe the latter reason as the boundary is over-obvious: the point within and outside Philadelphia present two totally different patterns.
fig, ax = plt.subplots(figsize=(24,24))
#usa2.plot(ax=ax,facecolor="none", edgecolor='none',lw=0.7)
philly.plot(ax=ax,facecolor="none",edgecolor='black',lw=4)
plt.plot(df.Start_Lng, df.Start_Lat, ".")
grid.plot(ax=ax,facecolor="none", edgecolor='black',lw=0.5)
ax.set_xlim(-75.4, -74.9)
ax.set_ylim(39.8, 40.2)
plt.title('Zoom into Philadephia', fontsize = 30)
plt.show()
import rtree
import shapely.speedups
shapely.speedups.enable()
# spatial join the
join = gpd.sjoin(gdf, grid, how="left")
join
all_join2 = join.reset_index()
# add new ID
all_join2.rename(columns = {'index':'New_ID'}, inplace = True)
# round time to hour
all_join2['Time_round'] = pd.to_datetime(all_join2['Start_Time']).dt.floor('60T')
# drop useless featue
all_model = all_join2.drop(['ID', 'Start_Lat','End_Lat','End_Lng','Start_Lng','Start_Time',
'End_Time','geometry','Source','TMC','Description','Number','Zipcode','Wind_Chill(F)',
'Country','Timezone','Airport_Code','Weather_Timestamp','Civil_Twilight','Astronomical_Twilight',
'index_right'], axis=1)
# dummy severity
all_model['Severity_1'] = np.where(all_model['Severity']==1, 1, 0)
all_model['Severity_2'] = np.where(all_model['Severity']==2, 1, 0)
all_model['Severity_3'] = np.where(all_model['Severity']==3, 1, 0)
all_model['Severity_4'] = np.where(all_model['Severity']==4, 1, 0)
all_model['Severity_3_4'] =all_model['Severity_3'] +all_model['Severity_4']
all_model['reference'] = all_model.Grid_ID.astype(str) + all_model.Time_round.astype(str)+all_model.Severity_3_4.astype(str)
# tackle NA
all_model['precipitation'] = all_model['Precipitation(in)'].fillna(0, inplace=False)
all_model = all_model.dropna(subset=['City','Temperature(F)','Wind_Speed(mph)','Pressure(in)','Humidity(%)',
'Weather_Condition','Visibility(mi)','Wind_Direction'])
all_model = all_model.drop(['Wind_Direction', 'Precipitation(in)','tem_qcut','humidity_qcut',
'visi_qcut'], axis=1)
NA_count = all_model.isna().sum().reset_index()
NA_count.columns = ['index', 'Value']
NA_count['count'] = (NA_count['Value']*100)/len(df)
NA_count
# caluclate mean
pre = all_model.groupby(['reference']).mean()[['precipitation','Temperature(F)',
'Visibility(mi)','Humidity(%)',
'Distance(mi)']].reset_index()
pre.columns = ['reference', 'mean_precipitation','mean_Temp','mean_Visibility','mean_Humidity','mean_Distance']
#pre['reference'] = pre.Grid_ID.astype(str) + pre.Time_round.astype(str)+pre.Severity.astype(str)
# count severe accidents
count1 = all_model.groupby(['reference','Grid_ID','Severity_3_4','year','month','week','weekday','hour']).count()['ref'].reset_index()
#count1.sort_values(by=['ref'],ascending=False)
#pd.value_counts(count1['ref']).reset_index()
# count prequency of POI within this
all_model.loc[:, 'Amenity':'Turning_Loop'] *=1
#select_grid = all_model['Grid_ID'].drop_duplicates(keep='first').reset_index()
count2 = all_model.groupby(['Grid_ID','Severity_3_4'
]).count().loc[:, 'Amenity':'Turning_Loop'].reset_index()
# select unique grids and its city & county & State
select_grid = all_model[['Grid_ID','State','City',]].drop_duplicates(subset='Grid_ID',keep='first').reset_index()
all_merge = pd.merge (count1,pre, how='right',on='reference')
all_merge = pd.merge (all_merge, select_grid[['Grid_ID','State','City']], how='left',on='Grid_ID')
all_merge = pd.merge (all_merge, count2, how='left',on=['Grid_ID','Severity_3_4'])
all_merge
d_weekday = pd.get_dummies(all_merge2.weekday, prefix ='weekday')
d_year = pd.get_dummies(all_merge2.year, prefix ='year')
#d_month = pd.get_dummies(all_merge2.month, prefix ='month')
d_week = pd.get_dummies(all_merge2.week, prefix ='week')
d_hour = pd.get_dummies(all_merge2.hour, prefix ='hour')
d_state = pd.get_dummies(all_merge2.State, prefix ='state')
#numeric = all_merge.drop(['reference','ref','State','weekday','year',
# 'month','week','hour','city',], axis=1)
all_c_n = pd.concat([all_merge, d_weekday, d_year,
d_week, d_hour, d_state],axis = 1).drop(['reference','ref','State','weekday','year',
'month','week','hour','City','State'], axis=1)
all_c_n
pd.set_option('display.max_row', None)
all_c_n.dtypes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
label = all_c_n['Severity_3_4']
features = all_c_n.drop('Severity_3_4',axis = 1)
from sklearn.model_selection import train_test_split
# Your code goes here
x_train, x_test, y_train, y_test = train_test_split(features, \
label, \
test_size=0.20)
This project applied the Random forest to train the model. After several attempts, the accuracy is about 60% while only with 100 estimators and max_detph is only 20. If we increase the n_estimators and max_depth, the accuracy of the model will undoubtedly increase. Due to the limited time & laptop storage memory & unstable environment, I didn't try other scenarios (But need to for next step).
clf = RandomForestRegressor(n_estimators = 50, max_depth=15)
clf.fit(x_train,y_train)
#5:16 40min
y_pred = clf.predict(x_test)
clf.score(x_test,y_test)
clf2 = RandomForestRegressor(n_estimators = 100, max_depth=20)
clf2.fit(x_train,y_train)
y_pred = clf2.predict(x_test)
clf2.score(x_test,y_test)
The plot shows the importance of each feature. It shows that Feature 0, aka, Grid_ID has the most strong predictability. Although put Grid_ID as a numeric feature into the model is not such accurate, to some extent it takes the geographic location into consideration.
importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
for f in range(x_train.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
plt.figure(figsize=(40,12))
plt.title("Feature importances")
plt.bar(range(x_train.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(x_train.shape[1]), indices)
plt.xlim([-1, x_train.shape[1]])
plt.show()
Analysis/Prediction Unit: The purpose of this project is to figure out which area in the USA is risky for the driver, and built a model to predict the potential future. Thus, this project cares more about the cluster of accidents rather than the single case (which might be caused by accident). Thus, this project applied raster (10km*10km grid) as the basic unit to count the number of accidents.
Big data: Firstly, the dataset has 3millon rows. Secondly, it's obvious that the exploration of accidents' geographic location will be helpful for the prediction, while it's not an easy task to the plot or play with the geodata frame who has 3 million rows. Even though this project use raster to make it easy and count the number of accidents within the grid (side=10km, 30000+),the model considered the time-related variables, e.g. month, week of day, hour, which also expand the data since the model have to record the number of accidents within the grid at a certain time. Apart from that,
Difficulty in finding other sources: To find other related variables is also not an easy task when the project is based on the whole USA, for example county-level census data, neighbourhood-level POI (point of interest) cluster data, road data (e.g. speed limitation, road width, turning curve rate, etc.
In consideration of above challenges, next step, this project should zoom into one city or State, as within the same city or State the data will be more consistent and accessible and the data size will shrink a lot to save the memory for other related features.
For this project, I used raster rather than the street because of the coding limitation, while to increase the accuracy. It seems that current grids are too big to apply the model into the reality. Thus, next step is to switch the analysis unit to street itself rathr than the spatial grid: one possible method is to cut all streets by 1km and predicted the probabilty of accidents at each hour (still classification problem). In that case, a large numebr of interesting vairables can be added into the model, like road type, street width, speed limitation, distance to nearby intersection, even the number of billboars within this 1km street (which will district the driver).